Evaluating Local Community Methods in Networks 



O 

o 

> 
O 

in 



s : 
a : 

& : 

ctf ■ 

: 

o ■ 

C/5 ■ 
>> 

Oh. 



> 
O 
00 

oo 
co 

o 
l> 
o 



James P. Bagrow 

Department of Physics, Clarkson University, Potsdam NY 13699-5820 USj^\ 
(Dated: October 4, 2007) 

We present a new benchmarking procedure that is unambiguous and specific to local community- 
finding methods, allowing one to compare the accuracy of various methods. We apply this to new 
and existing algorithms. A simple class of synthetic benchmark networks is also developed, capable 
of testing properties specific to these local methods. 



PACS numbers: 



|.75.Hc 87.23. Ge 89.20.Hh 89.75.-k, 



I. INTRODUCTION 

The study of complex networks [j], 0, HJ has recently 
arisen as a powerful tool for understanding a variety of 
systems, such as biological and social interactions [4, a], 
technology communications and interdependencies [l|, l(| , 
and many others. The problem of detecting communi- 
ties, subsets of network nodes that are densely connected 
amongst themselves while being sparsely connected to 
other nodes, has attracted agreat deal of interest due to 
a variety of applications @, H, H, EH HH 03 ■ Many tech- 
niques have been developed to find these subsets, with a 
broad array of costs and associated accuracies [13| . 

Many community-finding algorithms hinge upon max- 
imizing a quantity known as Modularity [14 , lla | , often 
defined as: 



Q 



2M 
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where A is the adjacency matrix, M is the total number 
of edges, ki is the degree of vertex i, and 5(c v ,c w ) = 1 
if nodes v and w are in the same community and zero 
otherwise. Thus Q is the fraction of edges found to be 
within communities, minus the expected fraction if edges 
were randomly placed, irrespective of an underlying com- 
munity structure but respecting degree. The second term 
then acts as a null model, and large values of Q indicate 
deviations away from a random network structure. 

Very efficient algorithms have been created utilizing 
greedy optimization of Q [HI, [TtJ , but any algorithm 
using Q must necessarily be a global method, requiring 
complete knowledge of the entire network. Meanwhile, 
it has been shown [l8[ that Q is not ideal, and a vari- 
ety of other techniques exist [13| , but these too generally 
require global knowledge. This knowledge isn't available 
for certain types of networks, such as the WWW, which 
is simply too large and evolves too quickly to have a fully 
known structure. In these circumstances, one must rely 
on a local method capable of finding a particular commu- 
nity within a network, without knowledge of the struc- 
ture outside of the discovered community. Several local 



methods exist, all of which attempt to find the commu- 
nity containing a particular starting node [l9|, [2(| [U H2] • 
In this work we present a new technique for quantify- 
ing the accuracy of a local method, so that one can de- 
termine how various algorithms perform relative to each 
other. Due to the unique dependence a local method has 
upon its starting node, we also develop a simple set of ad 
hoc benchmark networks, with a generalized degree dis- 
tribution, allowing one to test accuracy when the starting 
node is a hub, for example. We also present a new local 
method, as well as several types of stopping criteria in- 
dicating when an algorithm has best found the enclosing 
community. 



II. LOCAL COMMUNITY DETECTION 
METHODS 

We focus our efforts on two existing algorithms, due to 
Clauset HH and Luo, Wang, and Promislow (LWP) [12], 
as well as a new method. Several other local methods ex- 
ist, including those due to Flake, Lawrence, and Giles [l9j] 
and Bagrow and Bollt [2fJ, but these are either reliant 
on a priori assumptions of network properties (limiting 
applicability to specific types of networks, such as the 
WWW), or tend to be accurate only when used as part 
of a mo re g lobal method. Other methods (for example, 
[23L I25L l32l j) concern themselves with local community 
structure, but either require global knowledge to first 
determine this structure, or are defined locally but do 
not prov ide a definitive partition necessary for evalua- 

tion [H M, HE M, H m Hi laH ■ 

All three algorithms begin with a starting node s and 
divide the explored network into two regions: the commu- 
nity C, and the set of nodes adjacent to the community, 
B (each has at least one neighbor in C). At each step, 
one or more nodes from B are chosen and agglomerated 
into C, then B is updated to include any newly discov- 
ered nodes. This continues until an appropriate stopping 
criteria has been satisfied. When the algorithms begin, 
C = {s} and B contains the neighbors of s: B = {n(s)}. 



See Fig. l(a 
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The Clauset algorithm focuses on nodes inside C that 
form a "border" with B: each has at least one neighbor 
in B. Denoting this set Cborderj an d focusing on incident 
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edges, Clauset defines the following local modularity: 



where @ij is the adjacency matrix comprising only those 
edges with one or more endpoints in Cborder and [P] = 1 
if proposition P is true, and zero otherwise. Each node 
in B that can be agglomerated into C will cause a change 
in R, AR, which may be computed efficiently. At each 
step, the node with the largest AR is agglomerated. This 
modularity R lies on the interval < R < 1 (defining 
R = 1 when | Cborder | = 0) and local maxima indicate 
good community separation, as shown in Fig. O For a 
network of average degree d, the cost to agglomerate |C| 
nodes is 0{\C\ 2 d). 

The LWP algorithm defines a different local modular- 
ity, which is closely related to the idea of a weak com- 
munity [To| . Define the number of edges internal and 
external to C as M- ln and M ou t, respectively: 



The LWP local modularity Mt is then: 



M f (C) 
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When M/ > 1/2, C is a weak community, according 
to [l(J. The algorithm consists of agglomerating ewerj/ 
node in B that would cause an increase in M/, AM/ > 0, 
then removing every node from C that would also lead 
to AMf > so long as the node's removal does not 
disconnect the subgraph induced by C. (Removed nodes 
are not returned to B, they are never re-agglomerated.) 
Finally B is updated and the process repeats until a step 
where the net number of agglomerations is zero. The 
algorithm returns a community if M/ > 1 and s G C. 
Similar to the Clauset method, the cost of agglomerating 
\C\ nodes is C(\C\ 2 d). 

Finally, we present a new algorithm, as an illustra- 
tion of how simple an effective local method can be. Let 
us define the "outwardness" Q v (C) of node v € B from 
community C: 



k v 



k v 



(6) 
(7) 



where n(v) are the neighbors of v. In other words, the 
outwardness of a node is the number of neighbors outside 
the community minus the number inside, normalized by 
the degree. Thus, tt v has a minimum value of —1 if 
all neighbors of v are inside C, and a maximum value 




FIG. 1: (color online) (a) The community C is surrounded 
by a boundary of explored nodes B. This exploration implies 
an additional layer of nodes that are known only due to their 
adjacencies with B. (b) Two nodes i and j in B, with fii = 
2/3 and Qj = —1. Moving node j into C will give improved 
community structure, compared to moving i. 



of 1 — 2/k v , since any v G B must have at least one 
neighbor in C . Since finding a community corresponds to 
maximizing its internal edges while minimizing external 
ones, we agglomerate the node with the smallest SI at 



each step, breaking ties at random. See Fig. 1(b) 

This method is efficient for the following reasons. 
When a node v G B is moved into C, only the neigh- 
bors of v will have their outwardness' altered. For a node 
i G n(v), the change in Slj is just ASl^ = —2/ki since only 
a single link can exist between v and i. If node i was not 
previously in B, it will now have a single edge to C and 
f2j = 1 — 2/fcj. Calculating fli at each step thus requires 
knowing only ki, which may be expensive (for example, 
on the WWW), but needs only be calculated upon the 
initial discovery of i. 

For efficiency, one can maintain a min-heap of the out- 
wardness' of all nodes in B then, at each step, extract 
the minimum with cost C(log \ B\), and update or insert 
the neighboring Si's. For a network with average degree 
d, the cost of this updating is 0(d 2 log \ B\). This is often 
an overestimate, depending on the community structure, 
since a node's degree need only be calculated once. Then, 
the cost of agglomerating \C\ nodes is 0(\C\d 2 log |£?|). 
The relative sizes of C and B are highly dependent on 
the particular network and the current state of the al- 
gorithm, but \B\ ~ |C| seems reasonable. A sparse net- 
work with rich community structure would give a cost of 
0(|C|log|C|). 

While seeking to agglomerate the least outward nodes 
at each step seems natural, it lacks a nicely defined mea- 
sure of the quality of the community, analogous to R in 
the Clauset agglomeration. To overcome this we simply 
track -M ou t during agglomeration. The smaller this is the 
better the community separation, so we expect local min- 
ima in M ut when a community has been fully agglomer- 
ated. In addition, M ou t can be easily computed alongside 
agglomeration. After agglomerating node v, the change 
in Mout is just AM out = 2fc° ut - k v . As shown in Fig. H 
Mmt provides useful information about a real-world net- 
works' community structure, in this case the amazon.com 
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Agglomeration step, community size \C\ 

FIG. 2: (color online) Comparison between quality measures 
for the Clauset algorithm, R, and the method presented here, 
A/out. Shown are the average of 500 realizations of the 128 
node ad hoc networks, for z out = 1, 2, . . . , 6. 



co-purchasing network [4l| . 

Using M out as a measure of quality is not ideal, how- 
ever: it's not normalized, and (like the Clauset modular- 
ity) obtains a trivial value when the entire network has 
been agglomerated. The latter is less of an issue for lo- 
cal methods. More worrisome is the fact that M out may 
also be trivially small when C is small. See Fig. [5] for a 
comparison of R and M out . We continue to use M out for 
the sake of simplicity, but more involved measures may 
certainly lead to improved results. 



III. STOPPING CRITERIA 




100 200 300 400 500 600 700 



Agglomeration step, community size |C| 

FIG. 3: (color online) Comparison of a seminal physics text 
and a popular DVD (#1 seller at the time of calculation) on 
the amazon.com co-purchasing network. Fluctuations in M out 
in both items indicate the presence of non-trivial community 
structure. The smooth curve is for a 2D periodic lattice of 
500 x 500 nodes. 

for only a fraction p of nodes in C. Then, one can relax 
the condition by lowering p. Multiple values of p can be 
used simultaneously, at little cost, and the "best" result 
(smallest M out > 0, largest R < 1) can be retained as 
C. We do this for {p} = {0.75, 0.76, . . . , 1}. For specific 
details, see Appendix [A"l 

Another stopping criterion is what we refer to as Trail- 
ing Least- Squares. Fitting a polynomial to the plot of 
-Mout during agglomeration, one can identify the cusp or 
inflection point that indicates a community border. This 
method is somewhat involved but our benchmarking pro- 
cedure shows that it works quite well. See Appendix [Bj 



After identifying an appropriate agglomeration 
scheme, a local method must also be able to appropri- 
ately stop adding nodes. Here we suggest two possible 
schemes and will use the techniques and benchmarks 
of Sec. IIVI to compare them. It is important that the 
stopping criteria is also local; a criteria that spreads to 
the entire network then finds, e.g., the largest values of 
AM out is no longer a local algorithm. 

These stopping criteria are essentially divorced from 
the agglomeration schemes of most local algorithms, al- 
lowing one to mix and match to find more accurate meth- 
ods. We show this with the Clauset and new method from 
Sec. [TTJ The LWP algorithm already contains a stopping 
criteria and we use it unaltered. 

A subgraph C C G is a strong community when every 
node in C has more neighbors inside C than outside jlOj . 
[l9j . This may be used as a local stopping criterion in the 
following way: agglomerate nodes until C becomes, and 
then ceases to be, strong. Unfortunately, this can be too 
strict, since a single node can terminate the algorithm. 
Define a p-strong community as one where this is true 



IV. BENCHMARKING 
A. Test graphs 

It has become standard practice to test community al- 
gorithms with synthetic networks that possess a given 
community structure and a parameter to control how 
well separated the communities are. The traditional ex- 
ample is the so-called "ad hoc" networks [3 HH , which 
typically possesses 128 nodes divided into four equally 
sized communities. Each node has (on average) degree 
z — Zi n + z out = 16, where z ou t is the number of links 
a node has to nodes outside its community. A smaller 
z ou t (and correspondingly larger z- m ) leads to communi- 
ties that are easier to detect. 

These ad hoc networks have a sharply peaked degree 
distribution. Since local algorithms are dependent on a 
particular starting node, their accuracy might be affected 
if the starting node is a hub or a leaf [42j . So one would 
also like more realistic synthetic networks which possess 
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a wider degree distribution, such as a power law. To do 
this, we propose the following: 

1. Build a graph G of N nodes and M edges, per- 
haps using the configuration model and a given de- 
gree distribution. Throughout this work, we use 
Barabasi- Albert graphs of N = 512, and mo = 
8@. 

2. Randomly partition the nodes of G into two or more 
groups. These will serve as the "actual" communi- 
ties. We limit ourselves to four equally sized parti- 
tions. 

3. Choose random pairs of edges that are between the 
same two groups and rewire them to be within the 
groups, in such a way that the degree distribution 
is unaltered. 

This rewiring (or switching) technique, replacing edges 
(i,j) and (k,l) with edges (i,k) and (J, I) [H, HH, has 
been used in the past to destroy the presence of com- 
munity structure, allowing for a null model to test for 
false positives [3g. Here we do the opposite, and com- 
munities become more sharply separated as the number 
of rewirings increases. 

Since the partition is random, the initial modularity 
Qo will be very small. As edges are moved within com- 
munities, the first sum in Eq. fl} will grow but the second 
term will remain unchanged, since the degree distribution 
is unaffected. Therefore, the modularity of the actual 
partition Q(t) after t pairs of edges have been moved is 



Q(t) 



(8) 



Rewiring M/4 pairs of edges will give Q « 1/2, creating 
an appreciable amount of community structure in the 
previously randomized graph. 



one community of N nodes, when there were actually 
K communities of N/K nodes each, one could assign a 
+1/N for each correct node and —1/N for each incor- 
rect node, giving a composite score of 2/K — 1. This 
means that synthetic networks with different K's can- 
not be directly compared. While scores could be subse- 
quently re- normalized to lie between and 1, we propose 
an alternative that avoids these problems and is unam- 
biguous. 

Following the application introduced in [l3T ]. we use 
Normalized Mutual Information [53, HH to measure how 
well Pr and Pp correspond to each other: 



I(P R ,P F ) = 



-2£,£^iog(f5r-) 



£^.iogp 



• V .Y,log(^)' 



(9) 



where X is a 2 x 2 matrix with Xij being the number 
of nodes from real group i that were placed in found 
group j, X.j = Xij + X 2 j, and Xi. = X a + X i2 . In 
a sense, I(Pr,Pp) is a measure of how much is known 
about partition Pr by knowing partition Pp, with 1=1 
corresponding to perfect knowledge, and / = to no 
knowledge at all. 

In general, the confusion matrix X is Nr x Nf where 
Nr and Np are the number of real and found communi- 
ties, respectively. The application of Eq. is a limiting 
case corresponding to the binary partitioning inherent to 
local algorithms. 

In most figures, we have included a "faked" global 
method, the Clauset-Newman-Moore (CNM) algo- 
rithm [l5lll6j|. for comparison. This was done by running 
CNM to find the partitioning with the highest modular- 
ity, one random community was designated C, and the 
other communities were grouped together in C . A local 
algorithm is unlikely to match the accuracy of a global 
method, as shown. 



B. Evaluation 



RESULTS AND DISCUSSION 



Any local method creates a binary partition of the net- 
work into the community itself, C, and the remaining 
non-communnity nodes, C — V — C '. In a realistic set- 
ting V is unknown, but synthetic benchmarks allow one 
to know the full division. In addition, for a synthetic 
benchmark, the true partition Pr = {Cr, Cr\ is already 
known, while the found partition Pp = {Cp,Cp} may 
differ. 

Traditionally, the accuracy of the found communities 
is quantified by the fraction of correctly identified nodes. 
This has been shown to have drawbacks [33[ and the 
binary partitioning of a local algorithm poses further 
problems. For example, if the algorithm fails to stop 
in time, it has still identified every node in the commu- 
nity correctly, there are just additional nodes incorrectly 
attributed to that community. Should each incorrect 
node give a penalty? If the algorithm incorrectly finds 



The results of simulations, shown in Figs. [4HJJ indi- 
cate the relative accuracies of the various algorithms and 
stopping criteria. As shown in Figs. 2] and the LWP 
method performs extremely well for clearly separated 
communities, with a rapid decrease in accuracy as the 
separation blurs. 

The "best of fp}-strong" (Figs. 6 and 7) and trailing 
least-squares (Figs. 6 and 8) stopping criteria first per- 
form at comparable accuracy for both algorithms for the 
128-node ad hoc networks, but the trailing least-squares 
tends to perform better as community distinction blurs. 
Trailing least-squares outperforms {p}-strong in the 512- 
node networks (Fig. 8 vs. Fig. 9), suggesting that the 
size of the community impacts accuracy (which might be 
expected when fitting data). 

Overall, the best of {p}-strong has the least accuracy 
but is also least affected by the degree of the starting 
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FIG. 4: (color online) An overall comparison of the various 
methods for the 128-node ad hoc networks, averaged over 1000 
realizations. The LWP method is by far the most accurate 
for low z out , while the trailing least-squares methods offer the 
best performance at higher values. 




1300 1200 1100 1000 900 800 700 600 
number of rewirings, t 

FIG. 5: (color online) Using the "best of {p}-strong" criteria 
on the 512-node rewired networks, for {p} = 0.75, 0.76, . . . , 1. 
Each point averaged over 500 realizations. The effect of reject- 
ing any individual p-strong results where M ou t = (R = 1) is 
more apparent for these networks, especially for hub nodes. 

node. Meanwhile, trailing least-squares performs bet- 
ter overall but is more dependent on the starting node. 
The LWP algorithm is also quite accurate overall, though 
trailing least-squares can outperform it when the commu- 
nity separation is less clear. 

The agglomeration schemes presented share many sim- 
ilarities, and a certain amount of "cross-pollination" is 
possible. For example, accuracy may improve if one 
maintains the outwardness of nodes after agglomeration 
and, as per LWP, remove every node from C with posi- 
tive outwardness. Another possibility is simply agglom- 
erating all nodes with the minimum f2 together, instead 




1300 1200 1100 1000 900 800 700 600 



number of rewirings, t 

FIG. 6: (color online) A comparison of the trailing least- 
squares criteria for both the new algorithm and the Clauset 
method. Starting from a hub tends to be the most accurate, 
except when the communities are very well separated. 




1300 1200 1100 1000 900 800 700 600 
number of rewirings, t 

FIG. 7: (color online) The LWP algorithm used on the rewired 
benchmark networks. LWP performs very well for large num- 
bers of rewirings, but becomes progressively worse as less 
edges are moved. Both extremes, hubs and leaves, decrease 
overall accuracy. 

of breaking ties. This is not necessarily a trivial differ- 
ence: the agglomeration histories may diverge since the 
sequence of nodes exposed to B can differ. 

There is much room open to develop accurate stopping 
criteria. For example, the notion of a weak community 
can also be generalized to provide a (perhaps improved) 
stopping criteria. As defined, a community is weak when 
■Win > \Mout- This can be generalized by introducing 
a parameter to control how strict the constraint is: a 
community is p-weak when M- ln > pM out . Thus, a weak 
community corresponds to i-weak, and the LWP stop- 
ping criteria is 1-weak. While the introduction of a fur- 
ther parameter is not ideal, and the lack of performance 
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of the p-strong criteria versus the trailing least-squares 
is not promising, it may still be worth pursuing this and 
other, similar stopping criteria. Furthermore, stopping 
criteria using LS'-sets and fc-cores, as mentioned in [lOj, 
may also be worth investigation. 

In addition to finding a single community, any local 
method could be easily adapted to find more community 
structure, simply by running the local algorithm multiple 
times (possibly without repeated agglomeration of nodes 
or similar modifications). These quasi-local methods may 
not have the same level of accuracy as a global method 
— agglomerating communities sequentially may lead to 
compounding errors — but it may still be worth pursu- 
ing, even if only as an initialization step for a different 
algorithm. 

There is an implicit assumption, in all these meth- 
ods, that the underlying network is truly undirected. Of 
course, this is not generally true. In the WWW it is easy 
to know what pages an explored web page links to, but 
it is impossible to know how many other pages may link 
to the explored page. These back links are simply dis- 
regarded by the local methods, and it seems a difficult 
problem to overcome, especially when applying a quasi- 
local method and back links continue to be discovered as 
more communities are found. One possible way to over- 
come this is to maintain Q v after agglomeration, then go 
through all the found communities, remove nodes with, 
say, n > 0, then re-agglomerate them into the community 
with the smallest outwardness. Another idea, suggested 
in is to use a global index, such as a search engine, 
to list all the back links. It seems that in a different con- 
text, such as a partially explored social network, one has 
no choice but to ignore these back links until they are 
discovered, then adjust the results accordingly. 



VI. CONCLUSIONS 

Much recent work has been applied to the problem of 
finding communities in complex networks. In this pa- 
per, we have focused on the idea of finding a particular 
community inside of a network without relying on global 
knowledge of the entire network's structure, knowledge 
that is unavailable in a variety of areas. We have in- 
troduced a new and very simple local method, with a 
running time of 0(\C\ log |C|). Several types of stopping 
criteria have been introduced, which can be used in con- 
junction with different agglomeration schemes. 

Using Normalized Mutual Information, we have in- 
troduced a simple and unambiguous means of quanti- 
fying the accuracy of a local algorithm when applied to a 
synthetic network with pre-defined community structure. 
Synthetic networks with generalized degree distributions 
have been used to allow one to test the impact of the 
starting node's degree, something not possible with ex- 
isting ad hoc networks. 

These techniques have been applied to compare the 
accuracy of a variety of agglomeration schemes and stop- 



ping criteria and we feel they will be of great use when 
testing newly designed local algorithms. The fact that 
multiple stopping criteria and algorithms can perform 
with comparable accuracy shows that the community 
problem is ill-posed to the point of requiring heuristic 
methods, and thus it is worth using an evaluation scheme 
to compare and contrast alternative methods. 



APPENDIX A: STRONG COMMUNITIES 

As per [l(| HH , a subgraph C C G is a strong com- 
munity (denoted "ideal" in 19]) when every node in C 
has more neighbors inside C than outside: 



fcj n (C) > k° nt {C), Vi e C. 



(Al) 



This local quantity allows for a very simple, natural stop- 
ping criteria: agglomerate nodes until the community be- 
comes strong then, at each agglomeration step, check fc m 
and fc out for the newly chosen node and stop agglom- 
erating if the community would cease to be strong. If 
C never becomes strong, the algorithm won't terminate, 
indicating a possible lack of community structure in the 
explored region of the network. 

As shown in Fig. [SJ this "strong to not" criteria works 
well for sharply separated communities, but tends to fail 
as the contrast decreases. In a sense, a strong commu- 
nity is too strong of a requirement: as the distinction be- 
tween communities blurs, some nodes must fail Eq. (|A1|) . 
despite probable membership in C. 

We generalize the notion of a strong community in the 
following way. A community is p-strong if Eq. (|A1[) 
holds, not for all, but only a fraction p (or more) of the 
nodes: 



[knc)>kr\c)]> P \c\. (A2) 



E 

iec 



Equations (|AI|) and (|A2|) are equivalent when p = 1, 
while the requirement becomes increasingly lenient as p 
decreases. This allows one to tune the sensitivity by vary- 
ing p. See Fig. [5] 

An additional benefit of Eq. (|A2[) is that multiple val- 
ues of p can be used simultaneously [44j , since a commu- 
nity that is pi-strong is also p2-strong (pi > P2). More 
specifically, for the actual fraction p c g, 



PcS 



\C\ E [■ 

1 1 iec 



k'T{c) > fc° ut (c) 



(A3) 



C is p-strong for all p < p c g, and not p-strong for all 
P > PcS- 

To use, simply choose a set of appropriate parameters, 
{pi,P2, ■ ■ ■}, perform the local algorithm, and maintain 
the state of C as each p t stopping criteria is satisfied. 
One can further use a quality value, such as M out or R, 
and choose the best corresponding C% (in this case, that 
with the smallest M out or largest R [45|]). This "best 
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FIG. 8: (color online) The "strong to not" and trailing least- 
squares stopping criteria for the 128-node ad hoc networks 
using the Clauset method and the new algorithm presented 
here. Each point is averaged over 1000 realizations. 
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FIG. 9: (color online) Comparison of various p-strong stop- 
ping criteria for the 128 node ad hoc networks using the new 
algorithm shown in Sec. [XT] 



will blissfully continue to grow, far past the appropriate 
stopping point. Just as the strong stopping criteria is too 
strong, a weak stopping criteria is too weak. See Sec. |V] 
for further ideas, however. 

APPENDIX B: TRAILING LEAST-SQUARES 

Inspired by plots of R and M out , and in an effort to 
increase accuracy when community structure is less fa- 
vorable, we propose another stopping criteria, based on 
fitting a polynomial to M out (or R) to find local min- 
ima/maxima. Suppose n nodes have been agglomerated, 
fit y = ax 2 + bx + c to the first n — 3 values of M out . 
Then extrapolate y to points n — 2, n — 1, n and test the 
following: 

1. parabola opens downward, a < and, 

2. M ou t(i) > y(i), i — n, n — 1, n — 2 and, 

3. n — 3 > —b/2a and, 

4. M out (n) > M out (n - 1) > M out (n - 2). 

If all are satisfied, stop agglomerating (and remove the 
final three nodes). 

As shown in Fig. [8js inset, when you pass the border 
of the community, M ou t will start to increase, while the 
parabola, unaware of the next three values, continues 
downward. This works whether the minima is a cusp or 
just an inflection point, so one need not resort to testing 
first versus second differences in M out , etc. The fitting 
also provides a degree of smoothing. 

This criteria is somewhat involved and has several 
semi- arbitrary factors: one could extrapolate to a dif- 
ferent number of points, relax some of the constraints, 
fit a different order polynomial, continue fitting until the 
criteria ceases to be satisfied, etc. Our results indicate 
that this criteria as chosen works well, but further refine- 
ment is certainly possible. We also use this criteria by 
fitting a line to R from the Clauset method, since Eq. @ 
tends to grow linearly in the first community. Both fits 
have similar accuracy, as shown in Fig. [8l 



of {p}" stopping criterion does not entirely negate the 
introduction of a new parameter; choosing p too small 
(e.g. p = 0.1) can lead to stopping very early. For this 
work, we use {p} = {0.75, 0.76, . . . , 1.0}, but this may be 
worth further exploration. See Figs. |4] and [5l 

In addition to strong communities, weak communi- 
ties have been defined [101 ] . A community is weak when 
M- m > ^ Af out . We have found the usage of a "weak-to- 
not" stopping criteria to be problematic. The impact of 
a single agglomeration is so small that the community 
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